Matchtigs: minimum plain text representation of k-mer sets

We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-02968-z.

and in turn enable the exact same set of analyses. There are highly engineered solutions available to compute a compacted de Bruijn graph by computing unitigs from any set of strings in memory [23] or with external memory [33,40,41]. Incidentally, the set of unitigs computed from a set of strings is also a way to store a set of k-mers without repetition, and thus in reasonably small space. However, the necessity to preserve the topology of the graph makes unitigs an inferior choice to represent k-mer sets, as the sum of their length is still far from optimal, and they consist of many separate strings. The possibility to ignore the topology for k-mer-based methods opens more leeway in their representation that can be exploited to reduce the resource consumption of existing and future bioinformatics tools.
The need for such representations becomes apparent when observing the amount of data available to bioinformaticians. For example, the number of complete bacterial genomes available in RefSeq [42] more than doubled between May 2020 and July 2021 from around 9000 [43] to around 21,000. And with the ready availability of modern sequencing technologies, the amount of genomic data will increase further in the next years. In turn, analysing this data requires an ever growing amount of computational resources. But this could be relieved through a smaller representation that reduces the RAM usage and speeds up the analysis tools, and thereby allows to run larger pipelines using less computational resources. To fulfil this goal, a plain text representation would be the most useful: if the representation has to be decompressed before usage, then this likely erases the savings in RAM and/or adds additional runtime overhead. Formally, a plain text representation is a set of strings that contains each k-mer from the input strings (forward, reverse-complemented, or both) and no other k-mer. We denote such a set as a spectrum preserving string set (SPSS). Note that this definition is different from the SPSS defined by Rahman and Medvedev [44], who consider an SPSS to include the additional restriction that each k-mer must be present at most once. Such a plain text representation has the great advantage that some tools (like, e.g. Bifrost's query [23]) can use it without modification. We expect that even those tools that require modifications would not be hard to modify (like, e.g. SSHash [45] which we modified here as an example).

Related work
The concept of storing a set of k-mers in plain text without repeating k-mers to achieve a smaller and possibly simpler representation has recently been simultaneously discovered and named spectrum preserving string sets [without k-mer repetition] by Rahman and Medvedev [44] as well as simplitigs by Břinda, Baym and Kucherov [43]. To avoid confusion with our redefinition of the SPSS, we call this concept simplitigs in our work. Both Rahman and Medvedev and Břinda, Baym and Kucherov propose an implementation that greedily joins consecutive unitigs to compute such a representation. The UST algorithm by Rahman and Medvedev works on the node-centric de Bruijn graph of the input strings and finds arbitrary paths in the graph starting from arbitrary nodes. Each node is visited exactly by one path, and whenever a path cannot be extended forwards (because a dead-end was found, or all successor nodes have been visited already), then a new path is started from a new random node. Before a new path is started this way, if any successor node of the finished path marks the start of a different path, then the two paths are joined. During the traversal, the unitigs of the visited nodes are concatenated (without repeating the k − 1 overlapping characters) and those strings are the final output. Břinda, Baym and Kucherov's greedy algorithm to compute simplitigs (for which the authors provide an implementation under the name ProphAsm [43]) does not construct a de Bruijn graph, but instead collects all k-mers into a hash table. Then it extends arbitrary k-mers both forwards and backwards arbitrarily until they cannot be extended anymore, without repeating any k-mers. The extended k-mers are the final output.
Both heuristics greatly reduce the number of strings (string count, SC) as well as the total amount of characters in the strings (cumulative length, CL) required to store a kmer set. The reduction in CL directly relates to a lower memory consumption for storing a set of strings, but also the reduction in SC is very useful. For example, when storing a set of strings, not only the strings need to be stored, but also some index structure telling where they start and end. This structure can be smaller if less strings exist. There might even be cases where by increasing CL and decreasing SC, the overall size of the representation of the string set (strings + index structure) can be decreased. However, to stay independent of any specific data structure we only optimise CL. Břinda, Baym and Kucherov show that both SC and CL are greatly reduced for very tangled de Bruijn graphs, like graphs for single large genomes with small k-mer length and pangenome graphs with many genomes. Additionally they show merits of using heuristic simplitigs in downstream applications like an improvement in run time of k-mer queries using BWA [46], as well as a reduction in space required when storing heuristic simplitigs compressed with general-purpose compressors over storing unitigs compressed in the same way. Rahman and Medvedev show a significant reduction in SC and CL on various data sets as well, and also show a reduction in space required to store heuristic simplitigs over unitigs when compressed with general-purpose compressors. Khan et al. [40] also provide an overview of using heuristic simplitigs for various genomes, including also a human gut metagenome.
The authors of both papers also give a lower bound on the cumulative length of simplitigs, and show that their heuristics achieve representations with a cumulative length very close to the lower bound for typical values of k (31 for bacterial genomes and 51 for eukaryotic genomes). Břinda, Baym and Kucherov also experiment with lower values of k ( < 20 for bacterial genomes and < 30 for eukaryotic genomes) which make the de Bruijn graph more dense to almost complete, and show that in these cases, their heuristic does not get as close to the lower bound as for larger values of k. Further, the authors of both papers consider whether computing minimum simplitigs without repeating k-mers might be NP-hard. This has recently been disproven by Schmidt and Alanko [47], and in fact simplitigs with minimum cumulative length can be computed in linear time by using a subset of the matchtig algorithm. Their algorithm constructs a bidirected arccentric de Bruijn graph in linear time using a suffix tree, and then Eulerises it by inserting breaking arcs. It then computes a bidirected Eulerian circuit in the Eulerised graph and breaks it at all breaking arcs. The strings spelled by the resulting walks are the optimal simplitigs, named Eulertigs. Specifically, they leave out all parts from the matchtigs algorithm that relate to concatenating unitigs by repeating k-mers, and instead only concatenate consecutive unitigs in an optimal way. In line with the previous results about the lower bounds [43,44], Eulertigs are only marginally smaller than the strings computed by previous heuristics. All these suggest that no further progress is possible when k-mer repetitions are not allowed in a plain text representation.
There are already tools available that use simplitigs. The compacted de Bruijn graph builder cuttlefish2 [40] has an option to output simplitigs instead of maximal unitigs. A recent proposal for a standardised file format for k-mer sets explicitly supports simplitigs [48]. Also the k-mer dictionary SSHash [45] uses simplitigs to achieve a smaller representation and to answer queries more efficiently. Here, the higher efficiency is achieved both by reducing the space required to store the k-mers themselves, but also due to the lower string count reducing the size of the index data structures on top. Further, a recent proposal to index genomic sequences as opposed to k-mer sets works with simplitigs without modification [49], and with minor extra book-keeping also for general SPSSs. In that work, the size of the SPSS is very minor compared to the size of the index, however, major components of the index may be smaller if the SPSS contains less strings, which can be achieved by using greedy matchtigs. Our algorithms were also integrated into the external-memory de Bruijn graph compactor GGCAT [41], which was easy to do (source: personal communication with Andrea Cracco).
For compressing multiple k-mer sets, in [50] the authors use an algorithm similar to ProphAsm and UST that separates the unique k-mer content of each set from the k-mer content shared with other sets. For compressing k-mer sets with abundances, the plain text representation REINDEER [51] uses substrings of unitigs with k-mers of similar abundance, called monotigs. In the wider field of finding small representations of k-mer sets that are not necessarily in plain text, there exists for example ESSCompress [52], which uses an extended DNA alphabet to encode similar k-mers in smaller space.

Our contribution
In this paper we propose the first algorithm to find an SPSS of minimum size (CL). Moreover, we show that a minimum SPSS with repeated k-mers is polynomially solvable, based on a many-to-many min-cost path query and a min-cost perfect matching approach. We further propose a faster and more memory-efficient greedy heuristic to compute a small SPSS that skips the optimal matching step, but still produces close to optimal results in CL, and even better results in terms of SC.
Our experiments over references and read datasets of large model organisms and bacterial pangenomes show that the CL decreases by up to 26% and the SC by up to 90% over UST or ProphAsm (on larger datasets, sometimes UST cannot be run because BCALM2 cannot be run, and sometimes ProphAsm cannot be run because it does not use external memory). Compared to unitigs, the CL decreases by up to 59% and SC by up to 97%. These improvements come often at just minor costs, as computing our small representation (which includes a run of BCALM2) takes less than twice as long than computing unitigs with BCALM2, and takes less than 37% longer in most cases. Even if the memory requirements for large read datasets increase, they stay within the limits of a modern server.
Finally we show that besides the smaller size of a minimum SPSS, it also has advantages in downstream applications. As an example of a k-mer-based method, we query our compressed representation with the tools SSHash [45] and Bifrost [23]. These are state-of-the-art tools supporting k-mer-based queries in genomic sequences, using a representation of a k-mer set as a set of unitigs. By simply replacing unitigs with the strings produced by our greedy heuristic, and without modifications to Bifrost and a minor modification to SSHash disabling features that require unique k-mers, we get a speedup of up to 4.26× over unitigs, and up to 2.10× over strings computed by UST and ProphAsm. We call the modified version of SSHash "SSHash-Lite".

Basic graph notation
We give detailed definitions for our notation below in the "Preliminaries" section, but give an intuition about the required notation for the results section here already. Note that our notation deviates from standard mathematical bidirected graph notations, but it is useful in practice as it allows to implement bidirected graphs on top of standard graph libraries. We assume that the reader is familiar with the general concept of de Bruijn graphs.
Our bidirected graphs are arc-centric bidirected de Bruijn graphs. Arc-centric de Bruijn graph means that k-mers are on the arcs, and nodes represent k − 1 overlaps between k-mers. We represent the bidirected graph as doubled graph, i.e. by having a separate forward and reverse arc for each k-mer and a separate forward and reverse node for each k − 1 overlap. In this graph, binodes are ordered pairs (v, v −1 ) of nodes that are reverse complements of each other, and biarcs are ordered pairs (e, e −1 ) of arcs that are reverse complements of each other. Two biarcs (e, e −1 ) and (f , f −1 ) are consecutive if the normal arcs e and f are consecutive, i.e. the f leaves the node entered by e. A biwalk is a sequence of consecutive biarcs. If a biwalk visits a biarc, then it is considered to be covering both directions of the biarc. See Fig. 1a for an example.

Matchtigs as a minimum plain text representation of k-mer sets
We introduce the matchtig algorithm that computes a character-minimum SPSS for a set of genomic sequences. While former heuristics (ProphAsm, UST) did not allow to repeat k-mers, our algorithm explicitly searches for all opportunities to reduce the character count in the SPSS by repeating k-mers. Consider for example the arc-centric de Bruijn graph in Fig. 2a. When representing its k-mers without repetition as in Fig. 2b, we need 43 characters and 7 strings. But if we allow to repeat k-mers as in Fig. 2d, we require only 39 characters and 5 strings. It turns out that structures similar to this Fig. 1 Examples of de Bruijn graphs. Binodes are surrounded by a dashed box, where self-complemental binodes contain only one graph node. a A de Bruijn graph of the strings TTC TGA and GTC TGT . The colored and patterned lines are an arc-covering set of biwalks. b A de Bruijn graph containing two self-complemental nodes. The colored and patterned lines are a set of biwalks that visit each biarc exactly once. c a de Bruijn graph containing a self-complemental biarc. The bold colored line is a biwalk that visits each biarc exactly once example occur often enough in real genome graphs to yield significant improvements in both character and string count of an SPSS.
Similar to previous heuristics, our algorithm works on the compacted bidirected de Bruijn graph of the input sequences. However, we require an arc-centric de Bruijn graph, but this can be easily constructed from the node-centric variant (see the "Building a compacted bidirected arc-centric de Bruijn graph from a set of strings" section). In this graph, we find a min-cost circular biwalk that visits each biarc at least once, and that can jump between arbitrary nodes at a cost of k − 1 . This formulation is very similar to the classic Chinese postman problem [53], formulated as follows: find a min-cost circular walk in a directed graph that visits each arc at least once. This similarity allows us to adapt a classic algorithm from Edmonds and Johnson that solves the Chinese postman problem [54] (the same principle was applied in [55]). They first reduce the problem to finding a min-cost Eulerisation via a min-cost flow formulation, and then further reduce that to min-cost perfect matching using a many-to-many min-cost path query between unbalanced nodes. In a similar work [56], the authors solve the Chinese postman problem in a bidirected de Bruijn graph by finding a min-cost Eulerisation via a min-cost flow formulation. As opposed to [54,55] and us, in [56] the authors propose to solve the mincost flow problem directly with a min-cost flow solver. We believe this to be infeasible for our problem, since the arbitrary jumps between nodes require the graph in the flow formulation to have a number of arcs quadratic in the number of nodes.
Our resulting algorithm is polynomial but while it runs fast for large bacterial pangenomes, it proved practically infeasible to build the matching instance for very large genomes ( ≥ 500Mbp). This is because each of the min-cost paths found translates into roughly one edge in the matching graph, and the number of min-cost paths raises quadratically if the graph gets denser. Thus, our algorithm ran out of memory when constructing it for larger genomes, and for those where we were able to construct the matching instance, the matcher itself suffered from integer overflows, since it uses 32-bit integers to store the instance. Hence, for practical purposes, we introduce a greedy heuristic to compute approximate matchtigs. This heuristic does not build the complete instance of the matching problem, but just greedily chooses the shortest path from each unbalanced node to Eulerise the graph. This reduces the amount of paths per node to at most one, and as a result, the heuristic uses significantly less memory, runs much faster, and achieves near optimal speedups when run with multiple threads (see Additional file 1: Supplemental figure S1). While it can in theory produce suboptimal results as in Fig. 2c, in practice, the size of the greedily computed strings is very close to that of matchtigs, and the number of strings is always smaller.
Moreover, the minimality of matchtigs allows us to exactly compare, for the first time, how close heuristic algorithms to compute simplitigs are to optimal SPSS (on smaller genomes and on bacterial pangenomes, due to the resource-intensiveness of optimal matchtigs).
Our implementations are available on GitHub (https:// github. com/ algbio/ match tigs) as both a library and a command line tool, both written in Rust. They support both GFA and fasta file formats with special optimisations for fasta files produced by BCALM2 or GGCAT. Additionally, our implementations support gzip-compressed input and output, as well as outputting an ASCII-encoded bitvector of duplicate k-mers.

Compression of model organisms
We evaluate the performance of our proposed algorithms on three model organisms: C. elegans, B. mori, and H. sapiens. We benchmark the algorithms on both sets of short reads (average length 300 for C. elegans and B. mori, and 296 for H. sapiens) and reference genomes of these organisms. On human reads, we filter the data during processing so that we keep only k-mers that occur at least 10 times (min abundance = 10). This is because with a min abundance of 1, H. sapiens has 114 billion unique k-mers. This extreme k-mer count causes ProphAsm and our tool to run out of memory even with 2TiB of RAM.
We use the metrics cumulative length (CL) and string count (SC) as in [43]. The CL is the total number of characters in all strings in the SPSS, and the SC is the number of strings. We evaluate our algorithms against the same large genomes as in [43], using both the reference genome and a full set of short reads of the respective species (see Table 1 for the results). Since UST as well as matchtigs and greedy matchtigs require unitigs as input, and specifically UST needs some extra information in a format only output by BCALM2 [33], we run BCALM2 to compute unitigs from the input strings. We chose k = 31 , as it is commonly used in k-mer-based methods. While for larger genomes, larger k are used as well, we use the value k = 31 throughout the main matter to allow for easier comparison between results. Furthermore, for all data sets but the C. elegans reference, the matchtigs algorithm ran out of the given 256GiB memory, so we only compute greedy matchtigs for those.
On read data sets where we keep all k-mers, our greedy heuristic achieves an improvement of up to 26% CL and 82% SC over the best competitor (UST-tigs). The human read data set has smaller improvements, however it was processed with a min abundance of 10, yielding longer unitigs with less potential for compression. On reference genomes, the improvement in CL is smaller with up to 7%; however, the improvement in SC is much larger with up to 90%.
For C. elegans, where computing matchtigs is feasible as well, we observe that they yield no significant improvement in CL, but are even slightly worse in SC than the greedy heuristic. The greedy heuristic actually optimises SC more than the optimal matchtigs algorithm. That is because the matching instance in the optimal algorithm is built to optimise CL, and whenever joining two strings does not alter CL, the choice is made arbitrarily. On the other hand, the greedy heuristic makes as many joins as possible, as long as a join does not worsen the CL. This way, the greedy heuristic actually prioritises joining two strings even if it does not alter the CL. For more details, see the "Solving the min-cost integer flow formulation with min-cost matching" and "Efficient computation Table 1 Quality and performance of compressing model organisms We chose k = 31 and a min abundance of 10 for H. sapiens reads and 1 for all others. The CL and SC ratios are between compressed strings and unitigs, and in parentheses are the ratios between our algorithm and the best competitor. B2 means BCALM2. For time and memory, we report the total time and maximum memory required to compute the tigs from the respective data set. BCALM2 directly computes unitigs and ProphAsm directly computes heuristic simplitigs. UST, GREEDY and MATCH compute heuristic simplitigs, greedy matchtigs and matchtigs from unitigs. The number in parentheses behind time and memory indicates the slowdown/increase over computing just unitigs with BCALM2. All algorithms were run with 28 threads, except for UST which supports only one thread (the preceding run of BCALM2 was still executed with 28 threads), and ProphAsm, which supports only one thread as well. Matchtigs are too expensive to run on all genomes except for the C. elegans reference, and ProphAsm takes too much time on H. sapiens reads, especially since it does not support minimum abundance. The lengths of the genomes are 100Mbp for C. elegans, 482Mbp for B. mori and 3.21Gbp for H. sapiens and the read data sets have a coverage of 64× for C. elegans, 58x for B. mori and 300× for H. sapiens. The unique k-mer counts of the read datasets are 1. 35  of the greedy heuristic" sections. See Additional file 1: Supplemental figure S2 for more quality measurements with different k-mer size and min. abundance. We assume that the improvements correlate inversely with the average length of maximal unitigs of the data set. Our approach achieves a smaller representation by joining unitigs with overlapping ends, avoiding the repetition of those characters. This has a natural limit of saving at most k − 1 characters per pair of unitigs joint together, so at most k − 1 characters per unitig. In turn, the maximum fraction of characters saved is bound by k − 1 divided by the average length of unitigs. In Additional file 1: Supplemental figure S2, we have varied the k-mer size and min. abundance for our data sets to vary the average length of unitigs. This gives us visual evidence for a correlation between average unitig length and decrease in CL.
Our improvements come at often negligible costs in terms of time and memory. Even for read sets, the run time at most doubles compared to BCALM2 in the worst case. However, the memory consumption rises significantly for read sets. This is due to the high number of unitigs in those graphs and the distance array of Dijkstra's algorithm, whose size is linear in the number of nodes and the number of threads. See Additional file 1: Supplemental figure S3 for more performance measurements with different k-mer size and min. abundance.

Compression of pangenomes
In addition to model organisms with large genomes, we evaluate our algorithms on bacterial pangenomes of N. gonorrhoeae, S. pneumoniae, E. coli, and Salmonella, as well as a human pangenome. We use the same metrics as for model organisms. For the bacterial genomes, we choose k = 31 , but also for the human genome for the reasons argued above, and also for easier comparability of the results on the different genomes. We show the results in Table 2. See Additional file 1: Supplemental figure S4 for more quality measurements with different k-mer size and min. abundance, and Additional file 1: Supplemental figure S5 for more performance measurements with different k-mer size and min. abundance. In neither of them, we have included Salmonella or human, as they take too much time.
Our algorithms improve CL up to 19% (using greedy matchtigs) over the best competitor and SC up to 70% (using greedy matchtigs). Matchtigs always achieve a slightly lower CL and slightly higher SC than greedy matchtigs, but the CL of greedy matchtigs is always at most 2% worse than that of matchtigs. We again assume that the improvements are correlated inversely to the average size of unitigs, as suggested by the experiments in Additional file 1: Supplemental figure S4. These improvements come at negligible costs, using at most 15% more time and 11% more memory than BCALM2 when computing greedy matchtigs, except for the large salmonella pangenome, which took 50% more memory. The higher memory consumption is due to the graph being more tangled due to the high number of genomes in the pangenome. For matchtigs, the time increases by less than a factor of three and memory by at most 12% compared to BCALM2. See Additional file 1: Supplemental figure S5 for more performance measurements with different k-mer size and min. abundance.

k-mer-based short read queries
Matchtigs have further applications beyond merely reducing the size required to store a set of k-mers. Due to their smaller size and lower string count, they can make downstream applications more efficient. To make a concrete example, in this section we focus on membership queries. As already explained, each SPSS (unitigs, UST-tigs, matchtigs, etc.) can be considered as a (multi-) set of k-mers. Given a k-mer, a membership query is to verify whether the k-mer belongs to the set or not. We focus on exact queries, rather than approximate, i.e. if a k-mer does not belong to the set then the answer to the query must be "false". Assessing the membership to the set for a string Q longer than k symbols is based on the answers to its constituent k-mers: only if at least ⌊θ × (|Q| − k + 1)⌋ k-mers of Q belongs to the set, then Q is considered to be present in the set. The threshold θ is therefore an "inclusion" rate, which we fix to 0.8 for the experiments in this section.

Table 2 Quality and performance of compressing pangenomes
We chose k = 31 and a min abundance of 1. The CL and SC ratios are between compressed strings and unitigs, and in parentheses are the ratios between our algorithm and the best competitor. B2 means BCALM2. For time and memory, we report the total time and maximum memory required to compute the tigs from the respective data set. BCALM2 directly computes unitigs and ProphAsm directly computes heuristic simplitigs. UST, GREEDY and MATCH compute heuristic simplitigs greedy matchtigs and matchtigs from unitigs. The number in parentheses behind time and memory indicates the slowdown/increase over computing just unitigs with BCALM2. All algorithms were run with 28 threads, except for UST which supports only one thread (the preceding run of BCALM2 was still executed with 28 threads), and ProphAsm, which supports only one thread as well. The N. gonorrhoeae pangenome contains 8.36 million unique k-mers, the S. pneumoniae pangenome contains 19.3 million unique k-mers, the E. coli pangenome contains 341 million unique k-mers, the Salmonella pangenome contains 657 million unique k-mers and the human pangenome contains 2.8 billion unique k-mers. Due to its size, ProphAsm and MATCH could not be run on the Salmonella pangenome. Also due to size, BCALM2 did not run on the human pangenome, hence we used Cuttlefish 2. To still be able to compare against competitors, we ran ProphAsm on the unitigs produced by Cuttlefish 2 (UST requires extra information from BCALM2). To let Cuttlefish 2 run faster, we have used the flag -unrestricted-memory. Hence, its memory consumption is a lot higher than that of BCALM2 To support fast membership queries in compressed space, we build an SSHash-Lite dictionary over each SPSS. SSHash-Lite is a relaxation of SSHash [45,57] in that it supports membership queries without requiring each k-mer to appear once in the underlying SPSS. It is available at https:// github. com/ jermp/ sshash-lite. In short, SSHash is a compressed dictionary for k-mers -based on minimal perfect hashing [58] and minimizers [59] -which, for an input SPSS without duplicates and having n (distinct) k-mers, assigns to each k-mer in the input a unique integer number from 0 to n − 1 by means of a Lookup query. The result of Lookup for any k-mer that is not contained in the input SPSS is −1 . Therefore, SSHash serves the same purpose of a minimal perfect hash function over a SPSS but it is also able to reject alien k-mers. Two variants of SSHash were proposed -a regular and a canonical one. The canonical variant uses some extra space compared to the regular one but queries are faster to answer. (For all further details, we point the reader to the original papers [45,57].) Now, to let SSHash be able to query SPSSs with possible duplicate k-mers (e.g. matchtigs), it was only necessary to modify the return value of the Lookup query to just return "true" if a k-mer if found in the dictionary rather than its unique integer identifier (respectively, "false" if a k-mer is not found instead of −1 ). Therefore, SSHash-Lite can be directly used to index and query the unitigs, UST-tigs, and matchtigs as well.
We compare the performance of SSHash-Lite when indexing unitigs, UST-tigs, and matchtigs in Table 3. We build the SPSSs from three datasets: a ~309k× Salmonella Enterica pangenome; a 300× coverage human short read dataset filtered to exclude k-mers with an abundance lower than 10; and a 2505× human pangenome. The Salmonalla pangenome was queried with 3 million random Salmonella short reads with lengths between 70 and 502, and an N75 of 302. The human queries for both the human read dataset and the human pangenome are 3 million random short reads (296 bases each) from the human read dataset.
We see that matchtigs improve the performance of membership queries in both space and time compared to unitigs and UST-tigs. While the difference is more evident when compared to unitigs, matchtigs also consistently outperform UST-tigs -achieving the lowest space usage and faster query time across almost all combinations of dataset and index variant (regular/canonical).
Note again that the speed up in searching time is more evident on the human reads dataset since it is much larger than the Salmonella pan-genome and it is generally less evident for the canonical index variant of SSHash-Lite because it is approximately 2× faster to query than the regular one. Remarkably, regular SSHash-Lite over matchtigs achieves 26 − 59% reduction in space over unitigs while being also 4.26× faster to query on the human reads datasets. Compared to UST-tigs instead, matchtigs still retain 2.10× faster query time while improving space by up to 8% . These results were achieved on a typical bioinformatics compute node with many logical cores (256) and a large amount of RAM (2TB). In Additional file 1: Supplemental table S2, we performed the same experiment on a server with focus on single-thread performance, achieving slightly smaller improvements.
The reduction in index space when indexing matchtigs is to be attributed to the lower string count and fewer nucleotides in the collection. The speedups achieved by SSHash-Lite when indexing matchtigs instead of unitigs can be explained as follows. When querying, SSHash-Lite streams through the k-mers of the query. At the beginning, the tig containing the first k-mer of the query is determined using a minimal perfect hash function over the minimizers of the input SPSS, as well as the position of the k-mer in the tig. For the subsequent k-mers of the query, SSHash-Lite attempts to "extend" the matching of the k-mer against the identified tig by just comparing the single nucleotide following the previous k-mer in the tig. Extending a match in this way is extremely fast not only because just a single nucleotide needs to be compared but also because it is a very cache-friendly algorithm, dispensing random accesses to the index entirely. However, each time an extension is not possible (either because we have a mismatch or we have reached the end of the current tig) a "full" new search is made in the index. The search consists in evaluating the minimal perfect hash function and locating the k-mer inside another tig. Clearly, a search is much more expensive due to cache misses compared to an extension. Now, using longer tigs with a lower tig count -the case for the matchtigs -increases the chance of extension, or equivalently, decreases the number of full searches in the index. Compared to UST-tig, matchtigs can be faster to query exactly because allowing repeated k-mers to appear in the tigs further helps in creating Table 3 Performance characteristics of querying different tigs with SSHash-Lite SSHash-Lite is run with k = 31 and a k-mer-inclusion rate of 0.8. On the Salmonella pan-genome, we used a minimizer length of 17 for the regular index and a minimizer length of 16 for the canonical index. On the human reads, we used a minimizer length of 20 for the regular index and a minimizer length of 19 for the canonical index. On the human pangenome, we used a minimizer length of 19 for the regular index and a minimizer length of 20 for the canonical index. The search speedup is with respect to unitigs, and the search speedup in parentheses is with respect to the strings computed by UST. Index time is the end-to-end time required to build the SSHash-Lite index: it includes reading the collections from disk and building the data structure using external memory. Searching time is the time required to check which reads have at least 80% of their k-mers in the input SPSS. The number in parentheses under the genome is the k-mer hitrate, i.e. the fraction of k-mers from the query that are part of the queried dataset opportunities for extension. Therefore, by reducing the number of full searches, we can reduce the overall runtime of the query. See Additional file 1: Section 3 for a similar query experiment with the slightly older tool Bifrost.

Discussion
k-mer-based methods have found wide-spread use in many areas of bioinformatics over the past years. However, they usually rely on unitigs to represent the k-mer sets, since they can be computed efficiently with standard tools [23,33,40,41]. Unitigs have the additional property that the de Bruijn graph topology can easily be reconstructed from them, since they do not contain branching nodes other than on their first and last kmer. However, this property is not usually required by k-mer-based methods, which has opened the question if a smaller set of strings other than unitigs can be used to represent the k-mer sets. If such a representation was in plain text, it should be usable in most k-mer-based tools, by simply feeding it to the tool instead of unitigs.
Previous work has relaxed the unitig requirement of the representation of the k-mer sets to arbitrary strings without k-mer repetitions. This resulted in a smaller representation, leading to improvements in downstream applications. Additionally, previous work considered whether that finding an optimal representation without repeated k-mers is NP-hard, which was then disproven and shown to be linear-time solvable. We have shown that by allowing repetitions, there is a polynomial optimal algorithm that achieves better compression and improvements in downstream applications.

Conclusions
Our optimum algorithm compresses the representation significantly more than previous work. For practical purposes, we also propose a greedy heuristic that achieves near-optimum results, while being suitable for practical purposes in runtime and memory. Specifically, our algorithms achieve a decrease of 26% in size and 90% in string count over UST. Additionally, we have shown that our greedy representation speeds up downstream applications, giving an example with a factor of 2.10 compared to previous compressed representations.
Our implementation is available as a stand-alone command-line tool and as a library. We hope that our efficient algorithms result in a wide-spread adoption of near-minimum plain-text representations of k-mer sets in k-mer-based methods, resulting in more efficient bioinformatics tools.

Methods
We first give some preliminary definitions in the "Preliminaries" section and define our problem in the "Problem overview" section. Note that to stay closer to our implementation, our definitions of bidirected de Bruijn graphs differ from those in e.g. [56]. However, the concepts are fundamentally the same. Then, in the "Building a compacted bidirected arc-centric de Bruijn graph from a set of strings", "Reduction to the bidirected partial-coverage Chinese postman problem", "Solving the bidirected partial-coverage Chinese postman problem with min-cost integer flows", "Solving the min-cost integer flow formulation with min-cost matching", and "Efficient computation of many-to-many min-cost paths" sections, we describe how to compute matchtigs. The whole algorithm is summarised by an example in Fig. 3. For simplicity, we describe the algorithm using an uncompacted de Bruijn graph. However, in practice it is much more efficient to use a compacted de Bruijn graph, but our algorithm can be adapted easily: simply replace the Fig. 3 An example of the matchtigs algorithm. A The input genomic sequences. B We first build an arc-centric compacted de Bruijn graph (for simplicity, the reverse complements of the nodes and arcs are not shown). C In the graph we compute the bi-imbalances of the nodes (the difference between outdegree and indegree). D From each node with negative bi-imbalance we compute the min-cost paths to all reachable nodes with positive bi-imbalance. The costs of each arc are the amount of characters required to join two strings from the negative to the positive node while repeating the k-mers between the nodes. Specifically, the costs of an arc are |s| − (k − 1) , where |s| is the length of its label. E Using a min-cost perfect matching instance built from the min-cost paths, we decide which bi-imbalances should be fixed by repeating k-mers. The blue/tightly dashed edges are joining edges stemming from the min-cost paths. The red edges in longer dashes indicate that a node should stay unmatched, i.e. that fixing its bi-imbalance requires breaking arcs. The solution edges are highlighted in bold. There is one node in the matching problem for each binode in the original graph. The nodes x ′ are not reverse complements of nodes x, but stem from a reduction that makes a copy of each node. For more details, refer to the "Solving the min-cost integer ow formulation with min-cost matching" section. F For each joining edge in the solution we insert a joining arc into the DBG (in blue, small dashes), always directed such that the overall bi-imbalance decreases. The remaining imbalance is removed by inserting arbitrary breaking arcs (in read, longer dashes). G We compute a biEulerian circuit in the balanced graph. H We break the biEulerian circuit at all breaking arcs. I We output the strings spelled by the broken walks costs of 1 for each original arc with the number of uncompacted arcs it represents. In the "Efficient computation of the greedy heuristic" section, we describe the greedy heuristic.

Preliminaries
We are given an alphabet Ŵ and all strings in this work have only characters in Ŵ . Further, we are given a bijection comp : Ŵ → Ŵ . The reverse complement of a string s is s −1 := rev(comp * (S)) where rev denotes the reversal of a string and comp * the character-wise application of comp . For an integer k, a string of length k is called a k-mer. From here on, we only consider strings of lengths at least k, i.e. strings that have at least one k-mer as substring. We denote the prefix of length k − 1 of a k-mer s by pre(s) and its suffix of length k − 1 by suf (s) . The spectrum of a set of strings S is defined as the set of all k-mers and their reverse complements that occur in at least one string s ∈ S , formally spec k (S) := {r ∈ Ŵ k | ∃s ∈ S : r or r −1 is substring of s}.
An arc-centric de-Bruijn graph (or short de-Bruijn graph) DBG k (S) = (V, E) of order k of a set of strings S is defined as a standard directed graph with nodes V := {s | s ∈ spec k−1 (S)} and arcs E := {(pre(s), suf (s)) | s ∈ spec k (S)} . On top of this, we use the following notions of bidirectedness. An ordered pair of reverse-complementary nodes [v, v −1 ] ∈ V × V is called a binode and an ordered pair of reversecomplementary arcs [(a, b), (b −1 , a −1 )] ∈ E × E is called a biarc. Even though these pairs are ordered, reversing the order still represents the same binode/biarc, just in the other direction. A node v is called canonical if v is lexicographically smaller than v −1 , and an arc (a, b) is called canonical if the k-mer corresponding to (a, b) is lexicographically smaller or equal to the k-mer corresponding to (b −1 , a −1 ) . If an arc or a node is its own reverse-complement (called self-complemental), then it is written as biarc [(a, b)] or binode [v]. See Fig. 1 for examples of different bigraphs.
Since de Bruijn graphs are defined as standard directed graphs, we use the following standard definitions. The set of incoming (outgoing) arcs of a node is denoted by E − (v) ( E + (v) ), and the indegree (outdegree) is d − (v) := |E − (v)| ( d + (v) := |E + (v)| ). A walk in a de Bruijn graph is a sequence of adjacent arcs (followed in the forward direction) and a unitig is a walk in which all inner nodes (nodes with at least two incident walk-arcs) have exactly one incoming and one outgoing arc. The length |w| of a walk w is the length of the sequence of its arcs (counting repeated arcs as often as they are repeated). A compacted de-Bruijn graph is a de Bruijn graph in which all maximal unitigs have been replaced by a single arc. A circular walk is a walk that starts and ends in the same node, and a Eulerian circuit is a circular walk that contains each arc exactly once. A graph that admits a Eulerian circuit is Eulerian.
Assuming the complemental pairing of nodes and arcs defined above, we can define the following bidirected notions of walks and standard de Bruijn graph concepts. Biwalks and circular biwalks are defined equivalently to walks, except that they are sequences of biarcs. A biwalk w in a de Bruijn graph spells a string spell(w) of overlapping visited k-mers. That is, spell(w) is constructed by concatenating the string a from w's first biarc [(a, b)]) with the last character of b of the first and all following biarcs. See Fig. 1 −1 , v] . We assume that our graph is connected, as on multiple disconnected components, our algorithm can be executed on each component, yielding a minimum result.

Problem overview
We are given a set of input strings I where each string has length at least k, and we want to compute a minimum spectrum preserving string set, defined as follows.

Definition 1
A spectrum preserving string set (or SPSS) of I is a set S of strings of length at least k such that spec k (I) = spec k (S) , i.e. both sets of strings contain the same k-mers, either directly or as reverse complement.
Note that our definition allows k-mers and their reverse complements to be repeated in the SPSS, both in the same string and in different strings. This is an important difference to the definition of an SPSS by Rahman and Medvedev [44]. Their definition is equivalent to simplitigs, defined by Břinda, Baym and Kucherov [43]; hence, we use the term simplitigs when we refer to an SPSS without k-mer repetitions in this paper. Simplitigs are an SPSS such that for each present k-mer the reverse complement is only present if the k-mer is self-complemental, and the k-mer is only present once in at most one string of the SPSS.

Definition 2
The size ||S|| of an SPSS S is defined as where |s| denotes the length of string s. A minimum SPSS is an SPSS of minimum size.
On a high level, our algorithm works as follows (see also Fig. 3).
1 Create a bidirected de Bruijn graph from the input strings (see the "Building a compacted bidirected arc-centric de Bruijn graph from a set of strings" section). 2 Compute the bi-imbalances of each node (see the "Solving the bidirected partial-coverage Chinese postman problem with min-cost integer flows" section). 3 Compute the min-cost bipaths of length at most k − 1 from all nodes with negative bi-imbalance to all nodes with positive bi-imbalance (see the "Efficient computation of many-to-many min-cost paths" section). 4 Solve a min-cost matching instance with costs for unmatched nodes to choose a set of shortest bipaths with minimum cost by reduction to min-cost perfect matching (see the "Solving the min-cost integer flow formulation with min-cost matching" section). 5 BiEulerise the graph with the set of bipaths as well as arbitrary arcs between unmatched nodes.
6 Compute a biEulerian circuit in the resulting graph (see the "Solving the bidirected partial-coverage Chinese postman problem with min-cost integer flows" section). 7 Break the circuit into a set of biwalks and translate them into a set of strings, which is the output minimum SPSS (see the "Reduction to the bidirected partial-coverage Chinese postman problem" section).
Note that in our implementation, a substantial difference is that we do not build the de Bruijn graph ourselves, but we expect the input to be a de Bruijn graph already. For our experiments, we use a compacted de Bruijn graph computed with BCALM2. We motivate the reasons for optimality while explaining our algorithm, but also give a formal proof in Additional file 1: Section 4.

Building a compacted bidirected arc-centric de Bruijn graph from a set of strings
When building the graph we first compute unitigs from the input strings using BCALM2. Then we initialise an empty graph and do the following for each unitig: 1 We insert the unitig's first k − 1-mer and its reverse complement as binode by inserting the two nodes separately and marking them as a bidirected pair, if it does not already exist. The existence is tracked with a hashmap, storing the two nodes corresponding to a k-mer and its reverse complement if it exists. 2 We do the same for the last k − 1-mer of the unitig. 3 We add a biarc between the two binodes by inserting one forward arc between the forward nodes of the binodes, and one reverse arc between the reverse complement nodes of the binodes. The forward arc is labelled with the unitig, and the reverse arc is labelled with its reverse complement.
To save memory, we store the unitigs in a single large array, where each character is encoded as two-bit number. The keys of the hashmap and the labels of the arcs are pointers into the array, together with a flag for the reverse complement. Nodes do not need a label, as their label can be inferred from any of its incident arcs' label. Recall that in the description of our algorithm, we use an uncompacted graph only for simplicity.

Reduction to the bidirected partial-coverage Chinese postman problem
We first compute the arc-centric de-Bruijn graph DBG k (I) of the given input string set I as described in the "Building a compacted bidirected arc-centric de Bruijn graph from a set of strings" section. In DBG k (I) , an SPSS S is represented by a biarc-covering set of biwalks W (the reverse direction of a biarc does not need to be covered separately).
That is a set of biwalks such that each biarc is element of at least one biwalk (see Fig. 1a).
According to the definition of spell , the size of S is related to W as follows: Each walk costs k − 1 characters because it contains the node a from its first biarc [(a, b)]), and it additionally costs one character per arc.

Definition 3
(Graph transformation) Given an arc-centric de-Bruijn graph DBG k (I) = (V , E) , the transformed graph is defined as DBG ′ k (I) = (V , E ′ ) where E ′ is a multiset defined as E ′ := E ∪ (V × V ) . In E ′ , arcs from E are marked as non-breaking, and arcs from V × V are marked as breaking arcs. The cost function c(e), e ∈ E ′ assigns all nonbreaking arcs the costs 1 and all breaking arcs the costs k − 1.
The reverse complements of breaking arcs are breaking as well, and the same holds for non-breaking arcs. This means that biarcs always are either a pair of reverse complemental breaking arcs, in which case we call them breaking biarcs, or a pair of reverse complemental non-breaking arcs, in which case we call them non-breaking biarcs. By the same pattern, self-complemental biarcs are defined to be breaking biarcs or non-breaking biarcs depending on their underlying arc. Breaking arcs have the costs k − 1 because each breaking arc corresponds to starting a new walk, which by Eq. 1 costs k − 1.
In the transformed graph we find a circular biwalk w * of minimum cost that covers at least all original biarcs (to cover a biarc it is enough to traverse it once in one of its directions), as well as at least one breaking biarc. The reason for having at least one breaking biarc is that later we break the circular original-biarc-covering biwalk at all breaking biarcs to get a set of strings. If there was no breaking biarc, then we would get a circular string. Simply breaking a circular string at an arbitrary binode or a repeated non-breaking biarc does not produce a minimum solution in general, because there may be multiple circular original-biarc-covering biwalks with minimum costs, but with different repeated k-mers. When breaking the walk by removing a longer sequence of repeated k-mers, the resulting string gets shorter, the more repeated k-mers get removed. Hence we make finding an optimal breaking point part of the optimisation problem. We define such a walk as:

Definition 4
Given a transformed graph DBG ′ k (I) = (V , E) , a circular original-biarc-covering biwalk is a circular biwalk w such that for each non-breaking arc (a, b) ∈ E there is a biarc [(a, b), [(a, b)] in w. Additionally, w needs to contain at least one breaking biarc.  c((a, b)), and the costs of a self-complemental biarc [(a, b)] are c((a, b)).
This is similar to the directed Chinese postman problem (DCPP). In the DCPP, the task is to find a circular min-cost arc-covering walk in a directed graph. It is a classical problem, known to be solvable in O(n 3 ) time [60] with a flow-based algorithm, using e.g. [61] to compute min-cost flows. The partial coverage variant of the DCPP (requiring to cover only a subset of the arcs) is also known as the rural postman problem [62]. Further, the bidirected variant of the DCPP was discussed before in [56], and the authors also solved it using min-cost flow in O(n 2 log 2 (n)) time.
We break the resulting min-cost circular original-biarc-covering biwalk w * at all breaking arcs (hence we require it to contain at least one breaking biarc, otherwise we would get a circular string). The returned SPSS is minimum, because the metric optimised when finding w * matches Eq. 1.
Solving the bidirected partial-coverage Chinese postman problem with min-cost integer flows Edmonds and Johnson [54] introduced a polynomial-time flow-based approach that is adaptable to solve our variant of the DCPP. They show that finding a minimum circular arc-covering walk in a directed graph is equivalent to finding a minimum Eulerisation of the graph, and then any Eulerian circuit in the Eulerised graph. A Eulerisation is a multiset of arc-copies from the graph that makes the graph Eulerian if added, either by connecting nodes with missing outgoing arcs directly to nodes with missing incoming arcs, or by connecting them via a path of multiple arcs. A minimum Eulerisation is one whose sum of arc costs is minimum among all such multisets. To find such a minimum cost set of arcs, they formulate a min-cost flow problem as an integer linear program as follows: The variable x e is interpreted as the amount of flow through arc e, and the variable c e denotes the costs for assigning flow to an arc e. The costs are equivalent to the arc costs in the weighted graph specified by the DCPP instance. Objective (2) minimises the costs of inserted arcs as required. To ensure that the resulting flow can be directly translated into added arcs, Condition (3) ensures that the resulting flow is non-negative and integral. Lastly, Eq. (4) is the balance constraint, ensuring that the resulting flow is a valid Eulerisation of the graph. This constraint makes nodes with missing outgoing arcs into sources, and nodes with missing incoming arcs into sinks, with demands matching the number of missing arcs. Note that in contrast to classic flow problems, this formulation contains no capacity constraint. For a solution of this linear program, the corresponding Eulerisation contains x e copies of each arc e.
To adapt this formulation to our variant of the DCPP, we need to make modifications, namely: • compute the costs while treating self-complemental biarcs the same as other biarcs, • allow for partial coverage, • force cover at least one breaking arc (one of the arcs that are not required to be covered), • adjust the balance constraint for biwalks and • ensure that the resulting flow is bidirected, i.e. the flow of each arc equals the flow of its reverse complement.
(2) min e∈E c e x e s.t.

Bidirected costs
If we would simply count the costs of each arc separately, then self-complemental biarcs would cost 1 for each repetition, while other biarcs would cost 2 for each repetition, since other biarcs consist for two arcs. To circumvent this, we only count arcs corresponding to canonical k-mers in the cost function:

Partial coverage
In the partial coverage Chinese postman problem, we are additionally given a set F ⊆ E of arcs to be covered. In contrast to the standard DCPP, a solution walk only needs to cover all the arcs in F. In our case, the set F is the set of original arcs of the graph before Eulerisation. To solve the partial coverage Chinese postman problem we define outgoing covered arcs The resulting set of arcs is a minimum Eulerisation of the graph (V, F), and a Eulerian walk in this graph is equivalent to a minimum circular F-covering walk in the original graph.

Cover one breaking arc
Next to the partial coverage condition, we additionally require to cover at least one of the arcs that is not required to be covered. Since we forbid negative flows, we can express this as:

Bidirected balance
In contrast to Edmonds and Johnson, we are interested in a minimum circular biwalk that covers each original biarc. Analogue to the formulation for directed graphs, we make the following definitions:

Definition 6
(BiEulerian circuits and graphs) A biEulerian circuit in a bidirected graph is a bidirected circuit that visits each biarc exactly once. A biEulerian graph is a graph that admits a biEulerian circuit. A biEulerisation is a multiset of biarc-copies from the graph that makes a graph biEulerian if added. A minimum biEulerisation is one whose sum of arc costs is minimum among all biEulerisations of the same graph. ∀v ∈ V : e∈(E\F ) x e ≥ 1.
We can compute a biEulerisation in the same way as we compute a Eulerisation, the only change is in the balance constraint. Observe that for a Eulerian graph, the imbal- is zero for each node [63], because each node is entered exactly as often as it is exited. For binodes, the definition of the bi-imbalance bi v of a binode [v, v −1 ] or [v] follows the same idea. However, in contrast to directed graphs, there are two special cases. These are mutually exclusive, since the labels of an arc are of length k and those of a node are of length k − 1 , such that only one can have even length, which is required to be self-complemental in DNA alphabets.
Binodes  For self-complemental binodes [v] ∈ V , there is no concept of incoming or outgoing biarcs, since any biarc can be used to either enter or leave [v] (see Fig. 1b for an example). Therefore, for balance, biarcs need to be paired arbitrarily, for which to be possible there needs to be an even amount of biarcs. If there is an odd amount, then there is one unpaired biarc, hence the bi-imbalance is 1. The following condition exerts this behaviour (using mod as the remainder operator): Finally, we include partial coverage to above bi-imbalance formulations by limiting the incoming and outgoing arcs to F. Further, to distinguish between selfcomplemental nodes and others, we denote the set of self-complemental nodes as S ⊆ V and the set of binodes that are not self-complemental as T := V \ S . If selfcomplemental biarcs are included in the flow, then these alter the bi-imbalance by two, in the same way as they do in the equation of the bi-imbalance. We encode this on the left side of the equation. Then we get the following modified coverage constraint:

Valid bidirected flow
To adapt Edmonds' and Johnson's formulation to biwalks, we additionally need to ensure that the resulting flow yields a set of biarcs, i.e. that each arc has the same flow as its reverse complement:

Adapted flow formulation
With the modifications above, we can adapt the formulation of Edmonds and Johnson to solve the bidirected partial-coverage Chinese postman problem. We define F to be the arcs in the original graph, and set E := V × V . We further set c e = 1 for e ∈ F and c e = k − 1 otherwise. Lastly we define S and T as above. Then we get the following modified formulation. Note that it is conceptually similar to that proposed in [56], however different because the basic definitions differ, and we further allow for special arcs of which only one needs to be covered.
(9) ∀e ∈ E : x e = x e −1 In this min-cost integer flow formulation of the bidirected partial-coverage Chinese postman problem, analogue to the formulation of Edmonds and Johnson, sources and sinks are nodes with missing outgoing or incoming arcs, with demands matching the number of missing arcs in F. Our formulation would not be solvable for practical de-Bruijn graphs because inserting a quadratic amount of arcs into the graph is infeasible. However, most of the breaking arcs are not needed, since in a minimum solution they can only carry flow if they directly connect a source to a sink, by the following argument: Imagine a breaking arc that carries flow but is connected to a source or sink with at most one end. We can trace one unit of flow on the arc to a source and a sink, creating a path of flow one. By removing the flow from the path, and adding it to a breaking arc directly connecting the source to the sink, we get a valid flow. This flow has lower costs than the original, because it contains the same amount of breaking arcs, but a lower number of non-breaking arcs. This can be repeated until only breaking arcs that directly connect sources to sinks are left.
But even reducing the number of breaking arcs like this might not be enough if the graph contains too many sources and sinks. We therefore reduce the linear program to a min-cost matching instance, similar to Edmonds and Johnson.

Solving the min-cost integer flow formulation with min-cost matching
To solve the bidirected partial-coverage Chinese postman problem with min-cost matching, we observe that flow traverses the graph from a source to a sink only via mincost paths, since all arcs have infinite capacity. Due to the existence of the breaking arcs with low costs ( k − 1 ), we can further restrict the flow to use only paths of length at most k − 2 without affecting minimality. However, since we are also interested in a low number of strings in our minimum SPSS, we also allow paths of length k − 1 . We can precompute these min-cost paths efficiently in parallel (see "Efficient computation of many-to-many min-cost paths" section below). Then it remains to decide which combination of min-cost paths and breaking arcs yield a minimum solution.
To simplify this problem, observe that the pairing of sources and sinks that are connected via breaking arcs does not matter. Any pairing uses the same amount of breaking arcs, and therefore has the same costs. It only matters that these nodes are not connected by a lower-cost path that does not use any breaking arcs, and that there is at least one breaking arc. As a result, we can ignore breaking arcs when searching a minimum x e ≥ 1, and x e mod 2 = 0, and ∀e ∈ E ∶ x e = x e −1 (9) solution, and instead introduce costs for unmatched nodes. However, we still need to enforce that there is at least one pair of unmatched nodes. We do this using a special construction described below. Note though, that there can only be unmatched nodes if there are unbalanced binodes, i.e. the graph was not biEulerian in the beginning. However, if the graph is biEulerian already, the whole matching step can be left out, and instead a biEulerian circuit with one arbitrarily inserted breaking biarc can be returned. So we can safely assume here that the graph contains at least one pair of unbalanced binodes (it cannot contain a single unbalanced binode, see e.g. [47]). We formulate a min-cost matching problem with penalty costs for unmatched nodes, which can be reduced to a min-cost perfect matching problem. For the construction of our undirected matching graph M we define the set of sources A ⊆ T as all nodes with negative bi-imbalance, and the set of sinks B ⊆ T as all nodes with positive bi-imbalance. Then we add |bi v | (absolute value of the bi-imbalance of v) copies of each node from A, B and S to M. Further, for each min-cost path from a node a ∈ A ∪ S to a node b ∈ B ∪ S we add an edge (undirected arc) from each copy of a to each copy of b in M with costs equal to the costs of the path. We ignore self loops at nodes in S since they do not alter the imbalance, and nodes in A and B cannot have self loops.
Then, to fulfil Condition (9) (valid bidirected flow) and to reduce the size of the matching problem, we merge all nodes and arcs with their reverse complement (the unmerged graph is built here to simplify our explanations, in our implementation we directly build the merged graph). This additionally results in self-complemental biarcs forming self-loops in the merged graph, thus making them not choosable by by the matcher. But this is correct, as self-complemental biarcs alter the bi-imbalance of a binode by two, and therefore they can only be chosen by matching two different copies of the same binode in M.
Additionally, to fulfil Condition (6) (cover one bidirected arc), we add a pair of nodes u, w to M. We connect u and w to each node in M (but do not add an edge between u and w) and assign costs 0 to all those edges. This forces u and w to be matched to other nodes u ′ , w ′ , which means that when biEulerising, the bi-imbalances of u ′ and w ′ need to be fixed with at least one breaking arc.
Lastly, we assign each node other than u and w penalty costs of (k − 1)/2 for staying unmatched, as each pair of unmatched nodes produces costs k − 1 for using a breaking arc.
We reduce M to an instance of the min-cost perfect matching problem using the reduction described in [64]. For that we duplicate the graph, and add an edge with costs k − 1 between each node and its copy, but not between v and w and their respective copies. The costs for edges between a node and its copy are double the costs of keeping a node unmatched, because by choosing such an edge causes two nodes to stay unmatched.
After this reduction, we use Blossom V [65] to compute a solution. Since all nodes were doubled in the reduction, we actually get two solutions that might even differ, however both of them are minimum. We arbitrarily choose one of the solutions. This gives us a multiset of arcs that we complete with the breaking arcs required to balance the unmatched nodes to create a biEulerisation of the input graph. Following the approach from Edmonds and Johnson, we find a biEulerian circuit in the resulting graph which is a solution to the bidirected partial-coverage Chinese postman problem as required.
Note that our matching formulation only optimises CL, but does not optimise SC. It indirectly optimises SC because it decreases CL by joining strings, which also decreases SC by one each time. However, when joining two strings would not alter CL, Blossom V may output both variants, with joining the strings and without, while staying optimal. It then chooses an arbitrary option.

Efficient computation of many-to-many min-cost paths
Apart from solving the matching problem, finding the min-cost paths between sources and sinks is the most computationally heavy part of our algorithm.
We solve it using Dijkstra's shortest path algorithm [66] in a one-to-many variant and execute it in parallel for all sources. To be efficient, we create a queue with blocks of source nodes, and the threads process one block at a time. A good block size balances between threads competing for access to the queue, and threads receiving a very imbalanced workload. Since our min-cost paths are short (at most k − 1 arcs), in most executions of Dijkstra's algorithm only a tiny fraction of the nodes in the graph are visited. But the standard variant of Dijkstra's algorithm wastefully allocates an array for all nodes to store their distance from the source node (the distance array). To save space, we instead use a hashmap, mapping from node_index to distance from source. This turned out to be faster than using a distance array, even if the distance array uses an epoch system to only do a full reset every 2 32 queries. An epoch system stores a second value for each entry in the distance array indicating in what execution of Dijkstra's algorithm that value is valid. The execution counter gets incremented each execution, and only when it wraps around, the distance array is reset normally. As another optimisation, we abort the execution early when Dijkstra reaches costs greater than k − 1 , since we are only interested in paths up to costs k − 1.
Finally, in our implementation, we do not compute the actual sequences of arcs of the paths. Instead of copying the path arcs when biEulerising the graph, we insert special dummy arcs with a length equal to the length of the path. When breaking the final biEulerian circuit, if there are no breaking arcs but dummy arcs, then we break at a longest dummy arc to produce a minimum solution. If there are neither breaking nor dummy arcs, we proceed as described above. Then, when reporting the final set of strings, we define spell(·) to append the last ℓ characters of b when encountering a dummy biarc [(a, b), (b −1 , a −1 )] (or [(a, b)]) of length ℓ.

Efficient computation of the greedy heuristic
The greedy heuristic biEulerises the graph by greedily adding min-cost paths between unbalanced nodes, as opposed to choosing an optimal set via min-cost matching like our main algorithm. It then continues like the main algorithm, finding a biEulerian circuit, breaking it into walks and spelling out the strings.
To be efficient, the min-cost paths are again computed in parallel, and we apply all optimisations from "Efficient computation of many-to-many min-cost paths" section. The parallelism however poses a problem for the greedy computation: if a binode with one missing incoming biarc is reached by two min-cost paths in parallel, then if both threads add their respective biarcs, we would overshoot the bi-imbalance of that binode. To prevent that, we introduce a lock for each node, and before inserting a biarc into the graph we lock all (up to) four incident nodes. By locking the nodes in order of their ids we ensure that no deadlock can occur. Since the number of threads is many orders of magnitude lower than the number of nodes, we assume that threads almost never need to wait for each other. In addition to the parallelisation, we abort Dijkstra's algorithm early when we have enough paths to fix the imbalance for the binode. This sometimes requires to execute Dijkstra's algorithm again if a potential sink node was used by a different thread in parallel. But again, since the number of threads is many orders of magnitude lower than the number of nodes, we assume that this case almost never occurs.
In practice, the greedy heuristic achieves better results in terms of SC than the optimal matchtigs algorithm (see the "Results" section). This is because the greedy heuristics always joins two strings if it does not alter CL, while the optimal algorithm does not, as explained in "Solving the min-cost integer ow formulation with min-cost matching" section.

Minimising string count
In the paper we studied SPSSes of minimum total length (minimum CL). In this section, we note that an SPSS with a minimum number of strings (minimum SC), and with no constraint on the total length, is also computable in polynomial time.
The high-level idea, ignoring reverse complements, is as follows. Given the arc-centric de Bruijn graph G, construct the directed acyclic graph G * of strongly connected components (SCCs) of G. In G * , every SCC is a node, and we have as many arcs between two SCCs as there are pairs of nodes in the two SCCs with an arc between them. Clearly, all arcs in a single SCC are coverable by a single walk. Moreover, for two SCCs connected by an arc, their two covering walks can be connected via this arc into a single walk covering all arcs of both SCCs. Thus, the minimum number of walks needed to cover all arcs of G * (i.e. minimum SC SPSS) equals the minimum number of paths needed to cover all arcs of the acyclic graph G * . This is a classic problem solvable in polynomial time with network flows (see e.g. [67] among many).
However, such an SPSS of minimum SC very likely has a large CL, because covering an SCC with a single walk might repeat quadratically many arcs, and connecting the covering walks of two adjacent SCCs might additionally require to repeat many arcs to reach the arc between them.